Proof of concept of the potential of a machine learning algorithm to extract new information from conventional SARS-CoV-2 rRT-PCR results

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been and remains one of the major challenges modern society has faced thus far. Over the past few months, large amounts of information have been collected that are only now beginning to be assimilated. In the present work, the existence of residual information in the massive numbers of rRT-PCRs that tested positive out of the almost half a million tests that were performed during the pandemic is investigated. This residual information is believed to be highly related to a pattern in the number of cycles that are necessary to detect positive samples as such. Thus, a database of more than 20,000 positive samples was collected, and two supervised classification algorithms (a support vector machine and a neural network) were trained to temporally locate each sample based solely and exclusively on the number of cycles determined in the rRT-PCR of each individual. Overall, this study suggests that there is valuable residual information in the rRT-PCR positive samples that can be used to identify patterns in the development of the SARS-CoV-2 pandemic. The successful application of supervised classification algorithms to detect these patterns demonstrates the potential of machine learning techniques to aid in understanding the spread of the virus and its variants.

www.nature.com/scientificreports/ indicates the number of PCR cycles required to reach the threshold level of fluorescence associated with a positive result. Hence, the Ct value is inversely proportional to the viral load, although this correlation is not linear and depends on many factors. A recent work reviewed several works published on the connection of Ct with patient conditions and clinical outcomes 7 .
In addition, mutations that affect the template sequence where the primers bind can affect the amplification efficiency and, therefore, the Ct values for a specific gene. For SARS-CoV-2, increases in Ct values or nondetection of different genes associated with specific mutations of the virus have been described [8][9][10] . In fact, variant B.1.1.7 can be efficiently identified with specific tests through an undetectable S gene target or significant increase in the Ct value of the S gene compared to other targets [11][12][13] . Nevertheless, a significant delay of the Ct to N gene was observed in another test 14 , which confirms that the differences in Ct values between genes depend on the virus mutations but also on the primers used and, therefore, could be considered assay-dependent.
The mutation rates of coronavirus are low in general. However, in the case of SARS-CoV-2, the numbers reached in the pandemic have led to the accumulation of mutations and the emergence of multiple lineage and variants. Some of them are classified as variants of interest (VOIs), and variants of concern (VOCs) [15][16][17] , which have relevant epidemiological characteristics that may affect the virus's properties, spreads, clinical characteristics of disease, and vaccine and drug performance. For this reason, it is important to track known variants and implement surveillance systems capable of detecting significant changes in the predominant variants, the variants causing an outbreak or even the emergence of a variant of high consequence (VOHC). The gold standard for identifying and tracking variants in circulation is the whole genome sequencing, although it has important limitations related to costs, resource availability, lack of expertise, standardization, and time delay.
Currently, machine learning or deep learning models are being used in the medical field and, more specifically, in the field of COVID, even as a proof of concept like this work 18,19 . In particular, several works have developed classification models based on X-ray images [20][21][22][23] . Beyond image-based models, several works have addressed information from blood 23 or medical information 19,24 . More specifically, different works present a model established to predict COVID diagnose 25,26 and PCR results 27 based on clinical information alone.
There are two main approaches: unsupervised learning and supervised learning. In unsupervised learning, we should allow the algorithm to seek its own way to classify the data; however, for the supervised approach, we must give the algorithm a target set of clusters. In this work, we opted for the latter.
The Microbiology Department of the University Hospital of Vigo (Complexo Hospitalario Universitario de Vigo, CHUVI), was a pioneer in the use of pooling in saliva for detecting SAR-CoV-2 in a nonsymptomatic population 28 ; and over the pandemic, the 'pooling lab' has screened more than 750,000 individual samples by pooling. When we analyzed the results of the positive samples in pools versus individuals, we observed how the relationship between Ct values remained constant for each sample despite the increase caused by dilution. These relationships between Ct values could be gathered into similar profiles, and some groups seemed to have temporal accumulation. These findings and those explained in the previous paragraph made us consider whether there is an underlying 'signature' or 'pattern' in the Ct results of an rRT-PCR that can be used for the classification of samples.
The main goal of this work is to assess whether there is a Ct pattern that is characteristic of virus variants. Since ML algorithms require a high volume of training data to be effective and genome sequencing for variant determination was impractical, we decided to use the wave index as the characteristic target cluster. Hence, the working hypothesis is the following: each wave has a distinguishable pattern (signature) on the rRT-PCR results that allow an ML algorithm to efficiently predict the wave to which each individual test belongs.
Due to limitations in the number of tests completely sequenced, we will train our algorithm to predict the waves inside the evolution of the pandemic and then compare this prediction with the arrival and predominance of the different virus variants in the area where this study was conducted. By creating a large database to efficiently train a classification algorithm, we show that there is an underlying signature response in rRT-PCR results probably related to variations in viral genome.
The aim of this study is to demonstrate the presence of a distinguishable signature in the Ct pattern of rRT-PCR. However, it is important to note that any further generalizations should only be made once the results are confirmed through various labs, test conditions, and sequenced samples in which the variant is already determined. Until such verification is obtained, it would be premature to draw further conclusions.

Methods
Brief description of the rRT-PCR technique employed, primers, and software. We performed nucleic acid extraction in a MicrolabStarlet IVD platform using the STARMag 96 × 4 Universal Cartridge Kit (Seegene Inc, Seoul, South Korea). To detect SARS-CoV-2, we applied the Allplex™ SARS-CoV-2 Assay (Seegene Inc, Seoul, South Korea), a multiplex one-step rRT-PCR able to simultaneously detect four viral targets, including the structural protein envelope (E) gene, the RNA-dependent RNA polymerase (RdRP) gene, the spike (S) gene, the nucleocapsid (N) gene, and an exogenous RNA-based internal control (IC). This rRT-PCR step was run on a CFX96™ system (Bio-Rad Laboratories, Hercules, CA, USA), and the analysis was performed using Seegene Viewer-specific SARS-CoV-2 software (Seegene Inc, Seoul, South Korea), resulting in separate cycle threshold (Ct) values for the E and N genes and one combined Ct value for the RdRp and S genes (RdRp/S) in the FAM, Cal Red 610 and Quasar 670 channels, respectively. The HEX channel is used for internal control. Regarding interpretation of the results, according to the manufacturer's instructions, Cts values ≤ 40 are considered detected, and Cts value > 40 or not applicable (N/A) are considered not detected.
Ethics. The  www.nature.com/scientificreports/ according to the relevant guidelines and regulations, and patient data were anonymized. The dataset used and the waiver of informed consent was approved by Galician network of committees of research ethics.
Description of the dataset employed in this work. Positive samples from two different sources were used for this study. First, 3274 positive samples were obtained from the 688,763 samples processed by the pooling techniques between August 2020 and July 2021. Second, we also included 17,144 positive samples obtained from the 313,939 samples processed in the microbiology laboratory between February 2020 and March 2021. The samples processed in the 'pooling lab' are screenings to detect SAR-CoV-2 in a non-symptomatic population. The participants were asked to collect saliva (self-sampling) in TRANSPORT MEDIUM-2 (Vircell® Ref: TM013) immediately after waking up, following the manufacturer's instructions. Although each result pertains to an individual rRT-PCR for each sample, these samples were first flagged as possibly positive by group testing. Then, the original samples from the positive pool are individually analyzed. These are the results that are used here. Individual samples and pools were analyzed following the same standard rRT-PCR protocol described in "Brief description of the rRT-PCR technique employed, primers, and software" section.
The other samples were nasopharyngeal swabs processed individually in the laboratory of the CHUVI Microbiology Department as part of the assistance routine for SARS-CoV-2 diagnosis. It is important to note that the supply of this source of positive samples ended prematurely in March 2021 due to the need to change the reagent used in this laboratory (Allplex™ SARS-CoV-2 Assay to Allplex™ SARS-CoV-2/Flu A/Flu B/RSV assay, both from Segene Inc.) because of the high demand. In this way, we were able to keep the Allplex™ SARS-CoV-2 assay to group testing, since in this case, an assay change requires a full re-evaluation of the system, and the increase in the Cts for the N gene previously described by Wollschäger et al. 14 may have greater significance in group testing. As explained in "Classification results" section, the data from 12,313 positive samples obtained by the Allplex™ SARS-CoV-2/Flu A/Flu B/RSV assay between February and August 2021 could not be included in the present study.
Characterization of the wave concept. Since the pandemic began in March 2020, the concurrent increases and decreases in cases have been linked to the concept of 'wave' , which are determined using subjective, unofficial criteria. To the best of the authors' knowledge, this is an abstract nomenclature whose rigorous definition has not yet been clearly established to date. To characterize the pandemic dynamics in our area, we tracked the curve of active cases at the level of Galicia and, more specifically, Vigo and determined the boundaries between the so-called 'waves' in a data-driven way.
The database of active cases in the entire Galician region during the SARS-CoV-2 pandemic was obtained from data provided by the public health service of the Autonomous Spanish Community. In order to determine the time limits of each wave, the contagion curve is fit to a smoothed spline (R 2 = 0.99), and the waves are defined by the local minima and maxima of the curve, as shown in Fig. 1. Therefore, it can be concluded that although vaguely defined, waves are quite distinguishable, and the number of samples is inherently higher near the peak of each wave and much lower in their frontiers. Additionally, each new wave could also be associated with a higher proportion of samples with lower Ct values at the beginning 29 . www.nature.com/scientificreports/ Focusing now on the data available for this study, Fig. 2 shows how the existence of the waves is reduced to four clearly differentiated peaks. First, the slight increase in the number of cases experienced in autumn of 2020 is not seen in the data collected. The peak of cases in the second wave is concentrated in the last months of the year.
As will be explained in this work, the key aspect is the capacity of the machine learning algorithm to correctly predict the wave to which each sample belongs based on the numerical results of rRT-PCR for each gene. Therefore, a change in the target genes of the PCR performed during the fourth wave is too strong to indicate the temporal position of those tests and therefore had to be removed from our database to avoid giving an unfair advantage to the algorithm. Unluckily, the dire circumstances under which laboratories had to work during the pandemic led to this type of disturbance. Fortunately, in our case, it only significantly affected the fourth wave.
Descriptive analysis of the database used in the work. Even after extracting the samples that could lead to unfair results, the resulting database used in this study corresponds to a set of 20,418 PCR samples collected by the Microbiology Department of the CHUVI from March 2020 to July 2021. For each sample that tested positive for SARS-CoV-2 the database included an anonymized identification number, the date when the sample was taken, the threshold value for each target gene (E, N and RdRP/S) and the threshold value for the internal control (IC). The RdRP and S genes share the same channel; therefore, we obtained a single Ct value for both genes. Figure 3 shows the distribution of the number of cycles from the analyzed gene profiles, where the average Ct value is approximately 26 for genes E and N and close to 28 for the combination of genes RDRP/S. Some visual features arise from the simple analysis of the data csollected. As seen in Fig. 4, the RdRP/S gene distribution seems to be slightly offset toward higher Ct values; and in fact, a more abrupt end is shown. However, a strong linear relationship between the number of cycles of the three genes can be observed from the database (R 2 E-N = 0.96, R 2 R-N = 0.95, and R 2 E-R = 0.97). This is anticipated since the presence of the genes is expected to be similar and each number of cycles is allegedly related to the viral load of the individual; thus, the values of the numbers of cycles detected in any sample are usually quite close. Figure 5 shows the temporal evolution of the number of cycles of each gene during the pandemic. The figure clearly shows that, at least at first glance, there is no trend or time evolution that points to Ct differentiation over time.

Classification techniques employed.
To cluster the samples, a supervised learning technique would allow predicting the membership of a sample to a wave based simply on the number of cycles presented as a result of the PCR. Supervised learning algorithms are based on using labeled input data, i.e., with a correct answer with reference to its classification. Thus, as the algorithm is trained, it compares its predicted output with the correct input response until the error in its decision is minimized. In this work, we have labeled each sample with the wave that was dominant at the time the sample was taken from the individual. Since the definition of each wave, described in "Characterization of the wave concept" section, is unique, there is no possible ambiguity on the wave that we assigned to each sample. However, when a new wave becomes dominant, that means that, for the reasons discussed later, there is are new mechanisms in the pandemic progression that become dominant over the receding conditions of the previous wave. Hence, there is an intrinsic overlap that, by the methodology employed, cannot be resolved. Instead, the wave assigned to each sample is simply the dominant wave at that specific date.
Considering that a supervised algorithm was chosen to classify the waves, it was decided to confront two classic approaches within the machine learning field: a support vector machine (SVM) and a neural network (NN), using MATLAB R2021a as the main tool for the model's development and post-processing the results. The fundamentals of each of the classification techniques tested are completely different although their final performance, as will be shown, is similar. In the NN approach, the model learns according to the training strategy and adjusts the weights of each of the neurons towards the optimum, whereas in the SVM approach a maximum margin hyperplane is created by means of kernel functions that allow the increase of dimension and thus facilitates the classification task. A detailed mathematical description of both models utilized in this work, SVM and NN, can be found in the Supplementary material. Structure of the model. Figure 6 includes the information detailing the several steps that constitute the entire ML pipeline. First, the number of cycles of each of the genes for a single sample and an additional number of cycles corresponding to the internal control are taken as input parameters. Then, the training process starts after the machine learning algorithm is chosen.
The output of the algorithm corresponds to a confidence score that represents the probability that a sample belongs to a particular group. Considering that the main objective is to predict the membership of a sample to a wave, the output of this algorithm will correspond to a confidence level associated with the probability that a sample belongs to a wave. Thus, the wave with the highest confidence level assigned to it will be the one chosen as the predicted wave.
Subsequently, the prediction will be compared with the real wave value assigned to the sample. If the prediction coincides with the real value, it represents a good estimating; conversely, it represents an incorrect classification. The actual wave value of the sample is determined from the date of the sample taken and the estimated cutoffs with the approximation of the wave of active cases to a spline. www.nature.com/scientificreports/

Results
This section will discuss the results obtained during the present investigation. The section will begin with the characteristics of the algorithms used, then address choosing the best alternative by evaluating the outcome and end with some conclusions drawn from the analysis.

Metrics.
First, for the training of both models, a technique called cross-validation was utilized to avoid overfitting (i.e., the situation in which the network overlearns and extracts some noise as the main structure of the data, which affects the generalization of the predictions that tend to be incorrect). The idea behind cross-validation is to divide the database into a number of randomly chosen partitions, usually balanced, in order to train the model on subsets that it has not seen before. Thus, in this case, the database was divided into five parts. Four parts were used for training and the remaining part was used for testing. Then, the part used for validation was alternated as each training step was completed.
The criteria used to compare the results of the supervised algorithms was the accuracy together with the detailed information drawn from the confusion matrix. Considering that cross-validation was utilized, the accuracy was calculated as the percentage of observations correctly classified considering only the number of samples held out for validation in each training segment.  www.nature.com/scientificreports/ Among the diverse metrics used to evaluate a model, the confusion matrix is often used to evaluate the results of a model and identify its "weak points". In this matrix, the diagonal shows the percentage of correctly identified results (i.e., second waves identified as such), and the off-diagonal elements show the failure rates.
Furthermore, the confusion matrix also shows the true positive rate (i.e., the TPR) and false negative rate (i.e., the FNR) in the right-hand columns. The TPR is the proportion of samples correctly classified with respect to their true class, and the FNR is the proportion of samples incorrectly classified with respect to their true class.
Overview of the performances of the SVM and NN. The classifications made with both algorithms obtained similar results. This can be seen in the accuracy results in Table 1. This similarity was expected from the linearity observed in the data, which led to equivalent results for both approaches. The cases in which the SVM  Using as input data the number of cycles of each gene, and the number of cycles of the internal control (Ct IC), the Machine Learning models were trained to obtain a score with reference to the probability of each sample to be a member of every wave. In the example depicted, the score of the first wave is the highest and, therefore, this wave is chosen as the result of the prediction; this prediction is then compared with the wave covering the date when the sample was taken, and such comparison produces a true (coincident) or false (difference) result of the classification. www.nature.com/scientificreports/ shows a real improvement over the other techniques are usually related to nonlinear trends among the original data, which was not found in the present study. Table 1 includes the model characteristics for the SVM, following a one-versus-one multiclass method with no standardization of the data; and for the NN, which has a unique hidden layer with 100 neurons and uses the ReLU activation function. It is important to note that despite the large difference in training times between the two algorithms, their accuracies are practically identical. Figure 7a and b include the confusion matrices for the algorithms, where the blue color is related to responses that are correct and the orange tone is related to misclassified samples. The fourth wave is the most clearly identified, with an accuracy (micro averaged) of over 93% in both cases. However, the time frame between the second and third waves, as well as the transition from the first to the second, causes more confusion in the algorithms. As seen in the TPR column, the lowest hit percentage corresponds to the first wave in the case of the SVM (Fig. 7a) and to the third wave in the case of the NN (Fig. 7b).
Classification results. When discussing the results, the outcome offered by the SVM is considered since they showed quite similar results and the SVM is a model that has certain advantages over the neural network in terms of performance and extrapolation of results. One such advantage is the opportunity to avoid retraining the model with the input of new samples, which is required in the case of the neural network.
As seen in Fig. 8, the majority of individuals are classified within their wave, and those points (located in the middle-bottom area of the figure) that have been misclassified are usually assigned to the upcoming or preceding wave.
An example of this confusion can be seen clearly between the second and third waves, where the yellow dots (individuals from the second wave that have been identified as third) can be identified within the second wave region. This can also be noticed, to a lesser extent, in the section of the first wave where a small cloud of green dots is located in the lower area (that is, real first one individual wrongly predicted by the classifier as second wave samples).
Moreover, Fig. 9 shows how the increase in failures is related to a higher uncertainty in the solution obtained by the classifier. It should be considered that since the classifier has four possible answers, if a decision is made with a score close to 0.25, it means that all four options are necessarily similar in terms of score level. In contrast, those answers obtaining greater certainty (i.e., with a confidence level close to unity), correspond to a higher number of correct classifications and almost no confusion. Hence, the algorithm seems to be well aware of its real performance.
It is worth mentioning that since the ground truth for each sample is based only on the date of their tests, which in the end is converted into a class membership (wave), the overlapping waves and the underlying causes that will be discussed later can easily be explained the fact that some (maybe many) of the individuals misclassified by ML could be in fact individuals misclassified to their assigned waves.
In fact, the algorithm precision tends to be higher at wave centers where the sample characterization is more solid. In contrast, those points that are located near the frontiers of the waves tend to be more conflicting (which can be easily seen, for example, in the intersection between the second and third waves in Fig. 8).

Discussion
Based on the results obtained in the classification, a deep reason to justify how a mathematical model, using only the detected number of cycles of each gene together with the number of cycles detected by the internal control as input data, could have such a high and accurate precision rate for each wave (as shown in the gray areas of Fig. 10) was sought.
Starting with the results corresponding to the first wave (Fig. 10a), the accuracy area (whose maximum reaches 80% at the peak of the wave) coincides temporally with the appearance of variant 20A in the Galician region. More specifically, the end of the wave also coincides temporally with the peak and decrease in active variant 20B cases (dashed line in the yellow area). This indicates that in addition to being characterized by variant 20A, this wave also picks up some features of variant 20B that cause a spike in the failure rate in October 2020 and again in July 2021.
In the case of the second wave (Fig. 10b), the tendency of the accuracy rate to follow the presence of specific variants still persists with the appearance of 20E (EU1) at the start of the second wave. Even so, this wave shows a higher error rate prior to its beginning and once it has ended since the 20B variant is present simultaneously  (Fig. 10c), once again, its characterized by the presence of the 20E(EU1) variant together with 20I. This again causes a certain spike in the error rate due to the presence of these same variants during the rest of the waves. However, it is clearly observed that a higher accuracy corresponds to the coincidence of a higher percentage of cases of these variants simultaneously.
Finally, the fourth wave (Fig. 10d) is more clearly identified than the rest because the 21A variant is only present during the summer months of 2021. This means that the failure rate remains practically null until the arrival of this wave in April 2021.

Limitations.
The main limitation of this study is that it is applied only to the data from a single lab and area. This is justified by the fact that the dataset required for training had to avoid any trace of data that could lead the algorithm to distinguish data from other data. In fact, as mentioned above, this reduced the size of our dataset since we had to discard many tests due to the use of a different set of target genes during a specific period of time. The purpose of this work is to show that a distinguishable signature on the Ct pattern seems to exist, but until proven using different labs, test conditions, etc., no further generalization should be made than the mere exist- This percentage can also be broken down for each of the waves, showing for example that 23.3% of the sample predictions that really belonged to the first wave were assigned to the second wave. The color coding corresponds to orange shades for failures and blue shades for successes. www.nature.com/scientificreports/ Figure 8. SVM classification result for the data utilized, divided by waves, highlighting the predicted wave class assigned by the algorithm to the sample using color. The continuous black line shows the evolution of the number of infections throughout the pandemic (i.e., the number of active cases) and the scattered points in time show the average probability with which the classifier made the decision each day. That is, the average score is a ratio that shows the probability with which the classifier assigns a wave to each individual.

Figure 9.
Ratio as a function of the confidence intervals for the SVM results. The bars correspond to the confidence levels with which the algorithm made the decision to complete its classification, in a range between 0 and 1, adding all the confidence values given to each of the waves to 1. Values close to unity correspond to a higher confidence by the algorithm in making the decision, and vice versa for values close to 0. The right vertical axis shows the ratio of the ratio of successful (blue) or failed (red) samples to the total number of samples belonging to each confidence level. Finally, the shaded area refers to the number of samples that are concentrated at each confidence level, with most of the samples being between the confidence levels of 0.7 and 1. www.nature.com/scientificreports/ Figure 10. Accuracy compared to the prevalence of variants throughout the pandemic divided into the four waves (a-c and d respectively). The grayish area containing the wave number corresponds to the ratio of samples correctly classified within its wave, joined by a solid red line representing the ratio of sample misclassifications. On the other hand, the colored areas correspond to the occurrence of the variants being: blue (20A), yellow (20B), green (20E), red (20I) and violet (21A). It is important to mention that the data on the occurrence of the variants in the region of Galicia has been obtained from nextstrain.org.

Assessment of the potential interest of the proven concept. The ML tool proposed in this work
represents an additional tool that can improve the relevance of rRT-PCR results. The classical interpretation of a qualitative, Boolean result (positive or negative) can be completed with additional information regarding the estimation of probable virus variants (if recognized by an ML algorithm). This can be useful for many purposes, such as the following: pandemic control (quick detection of the arrival of new variants), as a screening tool for virus sequencing, as a quality check of the tests and/or reagents, etc. In addition, this work is just the initial step toward a completely new methodology applied to rRT-PCR not only in the case of SARS-CoV-2 but also in many other diseases.

Conclusions
In this work, we found that an ML algorithm trained with a sufficiently rich database can efficiently identify the moment in the SARS-CoV-2 pandemic when an individual was infected based on a simple, standard, rRT-PCR test with three channels (E, N and RdRP/S). No additional information regarding gender, age, condition, etc. was required by the algorithm. The subjacent reason for the precision of the ML algorithm seems to be an underlying characteristic signature of the main SARS-CoV-2 variants. Only by collecting a sufficient amount of data of different variants, individuals, tests, laboratories, etc. can the concept presented here be proven directly and not through the wave clustering concept employed here.
The results of this work can be a first step toward a new accessible and inexpensive surveillance method for tracking and/or selecting candidate samples for sequencing. Even with its limitations, this method may help monitor the changes to the virus and extend the surveillance to areas where current systems are scarcely implemented and have contributed significantly to the expansion of VOCs.

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/